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Abstract 

The ad hoc coordination problem is to design 
an autonomous agent which is able to achieve 
optimal flexibility and efficiency in a multi¬ 
agent system with no mechanisms for prior 
coordination. We conceptualise this problem 
formally using a game-theoretic model, called 
the stochastic Bayesian game, in which the 
behaviour of a player is determined by its pri¬ 
vate information, or type. Based on this model, 
we derive a solution, called Harsanyi-Bellman 
Ad Hoc Coordination (HBA), which utilises 
the concept of Bayesian Nash equilibrium in 
a planning procedure to find optimal actions 
in the sense of Bellman optimal control. We 
evaluate HBA in a multiagent logistics do¬ 
main called level-based foraging, showing that 
it achieves higher flexibility and efficiency than 
several alternative algorithms. We also report 
on a human-machine experiment at a public 
science exhibition in which the human partici¬ 
pants played repeated Prisoner’s Dilemma and 
Rock-Paper-Scissors against HBA and alter¬ 
native algorithms, showing that HBA achieves 
equal efficiency and a significantly higher wel¬ 
fare and winning rate. 


1 Introduction 

We are concerned with the ad hoc coordination problem, 
in which the goal is to design an autonomous agent, 
called the ad hoc agent, which is able to achieve optimal 
flexibility and efficiency in a multiagent system that 
admits no prior coordination between the ad hoc agent 
and the other agents. Flexibility describes the ad hoc 
agent’s ability to solve its task with a variety of other 
agents in the system. Efficiency is the relation between 
the ad hoc agent’s payoffs and time needed to solve the 
task. No prior coordination means that the ad hoc agent 


does not know ahead of time who the other agents are 
and how they behave. In particular, there are no prior 
agreements on information sharing, communication and 
action protocols, standards, etc. 


This problem is motivated by the fact that there is a 
growing number of agents, both robotic and virtual, 
which are employed in an increasing number of areas. 
Given that a primary goal in agents research is to in¬ 
crease the autonomy and thus lifetime of agents, it can 
be expected that agents based on different technolo¬ 
gies may have to interact in nontrivial ways, without 
knowing a priori who the other agents are. This moti¬ 
vates both the notion of flexibility, since the other agents 
could be based on any kind of technology, and efficiency, 
since there may be no time for long learning periods, 
especially if interactions are sparse. Human-machine 
interaction problems (e.g. robots used in rescue scenar¬ 
ios or software agents used in trading markets) can be 
viewed as a special case of ad hoc coordination, since 
humans have extremely variable behaviour (flexibility) 
and expect agents to be able to interact quickly (effi¬ 
ciency), while there may be no prior description of the 
human’s behaviour (no prior coordination). 


There have been several attempts to address ad hoc 


coordination in multiagent systems, e.g. Bowling and 


McCracken, 2005 

Dias et al., 2006 Stone et ah, 2010a 

While all of these 
tion, the assumpti 
ply that they only 
problem. For exa 

vorks are relevant to ad hoc coordina- 
ons made by the solutions therein im- 
address certain aspects of the larger 

mple, in Bowling and McCracken, 

2005 

Dias et ah, 2006 it is assumed that all agents 

follow pre-specified plans which include roles and syn- 
chronised action sequences for each role, and in Stone 

and Kraus, 2010 

Stone et al., 2010b[|Barrett et al., 

2011 

Agmon and Stone, 2012 it is assumed that the 


other agents’ behaviours are fixed and known, and that 
all agents have common payoffs. We also note that the 
problem descriptions in these works are of a procedu¬ 
ral nature, associated with the specific tasks considered 
therein. Therefore, there is a need for a formal model 




























of the ad hoc coordination problem, general enough to 
accommodate a wide spectrum of problems. 


A related problem is known in game theory as the in¬ 
complete information game. Therein, each player has 
some private information relevant to its decision mak¬ 
ing of which the other players are not aware, which is 
what relates the incomplete information game to the ad 
hoc coordination problem. Harsanyi, 1967 introduced 
Bayesian games in which the private information of a 
player is abstractly represented by its type, admitting a 
solution in the form of the Bayesian Nash equilibrium. 
Since then, there have been several works on learning 
in Bayesian games, e.g. Jordan, 1991[|Kalai and Lehrer 


1993[|Dekel et ah, 2004 . While the notion of private in¬ 


formation is useful to describe the ad hoc coordination 
problem, the learning processes and solutions studied 
therein are not directly applicable, since the focus has 
traditionally been on equilibrium considerations but 
not on efficiency. On the other hand, much work in mul¬ 
tiagent systems has focused on efficiency, whilst often 
making central assumptions about the other agent’s 
behaviours Albrecht and Ramamoorthy, 2012 . There¬ 
fore, it is natural to ask if these fields can be combined 
to address ad hoc coordination in a useful way. 


Inspired by this question, we model the problem using a 
game-theoretic construct called the stochastic Bayesian 
game, in which a player’s behaviour is determined by 
its type. Based on this model, we give formal defini¬ 
tions of flexibility and efficiency, and we define ad hoc 
coordination as the problem of optimising flexibility 
and efficiency, subject to the constraint that the ad hoc 
agent is unaware of the players’ type spaces, and hence 
the rules by which their types are assigned. Our model 
allows for both the definition of Bayesian Nash equi¬ 
librium and, since it satisfies the Markov property, the 
definition of Bellman optimal control [Bellman, 1957 , 
a key result in intelligent agents. We combine these two 
concepts to obtain a solution which we call Harsanyi- 
Bellman Ad Hoc Coordination (HBA). HBA does not 
rely on a central assumption about the other agents’ 
behaviours. Instead, it allows for the specification of 
multiple such assumptions which are provided to HBA 
as a set of user-defined types, each corresponding to 
a different hypothesis of how an agent might behave. 
Based on the agents’ observed actions, HBA computes 
probability distributions over the user-defined types, 
called posteriors, and utilises them in a planning pro¬ 
cedure to find optimal actions. 


HBA has a number of useful features with respect to ad 
hoc coordination. The fact that the user-defined types 
may encapsulate any kind of behaviour means that HBA 
can potentially deal with a variety of different agents, 
including agents which maintain beliefs about the be¬ 
haviour of the HBA agent, or any other type of recursive 


reasoning. We show this in a human-machine experi¬ 
ment conducted at a public science exhibition, in which 
HBA was able to manipulate the beliefs of humans in 
repeated Prisoner’s Dilemma such that both ended up 
cooperating, thus maximising its efficiency. HBA also 
supports the possibility that agents may switch between 
different behaviours. We address this by introducing 
temporally reweighted posteriors which allow HBA to 
quickly recognise changed types. In our human-machine 
experiment, this allowed HBA to achieve a significantly 
higher winning rate in Rock-Paper-Scissors than the 
human participants and an alternative algorithm. 


A central feature of HBA is that it can use the types 
to plan in the entire state space of the problem (in¬ 
cluding unseen states) provided that the posteriors and 
user-defined types are reasonably accurate. To accom¬ 
modate the case in which none of the user-defined types 
accurately describe an agent’s behaviour, HBA is able 
to include methods for opponent modelling. We propose 
an opponent modelling method, called conceptual type, 
which can be viewed as a kind of type that specifies 
the conceptualisation underlying a behaviour, rather 
than specifying the behaviour directly. The conceptu¬ 
alisation is combined with the observed actions of an 
agent to generalise its actions to unseen states and im¬ 
prove accuracy in rarely visited states. We demonstrate 
these features in a multiagent logistics domain called 
level-based foraging, in which HBA is able to achieve 
significantly higher flexibility and efficiency than three 
alternative algorithms (JAL [Claus and Boutilier, 1998 


CJAL Banerjee and Sen, 2007 , WoLF-PHC Bowling 
and Veloso, 2002]), using just a few user-defined types. 


2 Defining Ad Hoc Coordination 


2.1 Stochastic Bayesian Games 

As discussed earlier, ad hoc coordination can be defined 
based on the notion of private information in Bayesian 


games. However, in their original form Harsanyi, 1967 


Bayesian games are not descriptive enough to allow us 
to model the kinds of problems we are interested in, as 
they do neither include states nor time. Therefore, we 
combine Bayesian games with the concept of stochastic 


games Shapley, 1953 to obtain a more descriptive 
model which we call stochastic Bayesian (?a?7iej^ 

Definition 1. A stochastic Bayesian game (SBG) con¬ 
sists of: 

• discrete state space S with initial state s^ G S and 
terminal states S C S 


'^A related model are I-POMDP, in which agents face 
incomplete information with respect to the state of the world 


and the behaviour of other agents Gmytrasiewicz and Doshi 
2005| . However, I-POMDP are extremely complex and their 


solution methods are infeasible in most problems. 

































• players N = {1, n} and for each i £ N: 

— set of actions Ai (where A = Ai x ... x An) 
~ type space 0^ (where 0 = 0i x ... x 0„) 

— payoff function ut : S x A x Qi ^ R 

— strategy tt^ : El x x 0^ —>• [0,1] 

• state transition function T : ^ x A x S' —>• [0,1] 

• type distribution A : No x 0 —>• [0,1] 

El contains all histories = (s°, o°, a^,..., s‘) with 

t > 0, (s"^, o’") G S X A for 0 < T < t, and s* G S. 


We also define several classes of type distributions: 

Definition 2. A type distribution A is called static if 
Vt, f G 0 : A{t, 6) = A{i, 9), else it is called dynamic. 

Definition 3. A type distribution A is called pure if 
Vt 30 G 0 : A(t, d) = 1, else it is called mixed. 


A SBG starts at time t = 0 in state s°. fn state s‘, the 
types 6{,...,9n are sampled from 0 with probability 
A{t, {9\, ..., 9n)), and each player i G N is only informed 
about its own type 0‘. Based on the history each 
player i chooses an action a\ G Ai with probability 
TTi{H*,al,9l). Given the joint action a* = (aj,...,a^), 
the game transitions into a successor state G S 
with probability T{s*,a*, s*'^^) and every player i re¬ 
ceives an individual payoff given by Ui{s*,a*,6l). This 
process is repeated until the game reaches a terminal 
state s‘ G S, after which the game stops. 


Our definition of types follows the original definition 
of Harsanyi, 1967 , which means that a type determines 
a player’s payoffs and strategies. However, since we 
define strategies with respect to a history of states and 
actions (rather than just the current state), a type may 
in fact specify strategies which change over time (such 
as players who learn or use recursive reasoning), and 
we thus also refer to it as behaviour. Therefore, our 
interpretation of types is that of a “programme” which 
governs the behaviour of a player. 


Each player may correspond to a specific role in the 
game. For instance, if we model a soccer team, player 1 
may correspond to the goal keeper. Therefore, in the 
following sections, we implicitly assume that the ad hoc 
agent, denoted a, controls the player of interest, denoted 
i, by which we mean that a chooses the strategy tt^. 
Furthermore, i has a fixed type which is known to a, 
and we denote its payoffs by Ui{s*, a*, a). 


2.2 Flexibility & Efficiency 

Two important aspects of ad hoc coordination are flex¬ 
ibility and efficiency. We now define each of them for¬ 
mally within the SBG model. The definitions rely on 
the notion of paths and probabilities of paths: 


Definition 4. A path p in SBG T is a sequence 
(Sp,^'p,a°,sJ,0i,ai, ...,5^") where sf G S, 9J, G 0, 

a” G A, and s° = s°. A path p is terminating if Sp” G S, 
otherwise it is non-terminating. Given a type distribu¬ 
tion A for r, the probability of path p is defined as 
Pr(p|r,A) = 


t„-i 


n A(r, 91) T(s;, a;, s^+i) [] (a^),, (0^)0 


r=0 


k£N 


where Hf is the history extracted from p until time r. 

For Pr(p|r, A) to be well-dehned (i.e. there is a set X 
with \/p gX: Pr(p|r, A) > 0 and J2pex A) = 1), 

it is important to note the following two implications in 
the definition of SBGs. Firstly, no path p can be prefixed 
by a terminating path, i.e., there is no G p such that 
T < tp and Sp G S. This is important since otherwise 
Pr(p|r, A) might assign positive probability to a path 
which is prefixed by a terminating path and, thus, could 
never occur. Secondly, the only paths that can occur are 
either terminating (and hence finite) or non-terminating 
and infinite (i.e. t -G oo). Thus, if d> is the set of all 
terminating paths and dt the set of all infinite non¬ 
terminating paths, then X]pG$u^ A) = 1- 

Based on the notion of paths, we define the flexibility 
and efficiency of ad hoc agent a as follows: 

Definition 5. Let $ be the set of all terminating paths 
in SBG T. Given a set of type distributions D for T, 
the flexibility F(Q;|r,D) and efficiency i3(a|r,D) of a 
in r with respect to D are defined as 


F(a|r,D) 


F(a|r,D) 




^^Pr(p|r,A) 

AgD pG$ 


EE Pr(p|r,A) 

AGDpG$ 



M*(Sp,a”, 



ri 


(tp)^^ 


where Pr(p|r, A) = y;^,^^^pqp'l|r,A) ’ > »'2 > 1 spec¬ 

ify the relative importance between payoff and time. 


F(a|r,D) and i3(Q!|r,D) can be interpreted as, respec¬ 
tively, the average probability that a solves a task in 
r and the average payoff per time step a received in 
solved tasks, where D specihes all constellations of types 
that can occur. There may be problems in which flexi¬ 
bility is not a relevant metric because termination is 
guaranteed for some reason. In such cases, the primary 
metric is efficiency. 


2.3 The Ad Hoc Coordination Problem 

We are now in a position to formally define the ad hoc 
coordination problem. The core aspect is that there is 
no prior coordination between the ad hoc agent and the 





Algorithm 1 Evaluation procedure 

Input: SBG F, set of type distributions B, 

ad hoc agent a, player i (to be controlled by a) 
Output: flexibility /^(alF, D), efficiency i5(a|r,D) 

F ^0 
F^O 

Repeat K times: 

Randomly draw type distribution A € D 
Generate path p in F with A (a controls i) 

If p terminates do 
F + 1 

-E ■<—-E + Wi(Sp,ap,a)^ * (tp) 

E(a|F,D) F/K 
F;(a|F,D) ^ E/K 


other agents in the system. We express this formally 
by requiring that the ad hoc agent does not know the 
type spaces 0j of the other players and, therefore, the 
type distribution A of the game. 

Definition 6. Let F be a SBG with type spaces 0j, 
and let D be a set of type distributions for F. The ad 
hoc coordination problem is to optimise the flexibility 
E(a|F,D) and efficiency i?(a|F,D) of ad hoc agent a, 
subject to the constraint that a does not know Qj (and, 
therefore, the type distributions A). 

Computing F(a|F,D) and i?(a|F,D) exactly is infeasi¬ 
ble for all but the simplest games. We propose to approx¬ 
imate these by using the procedure given in Algorithm[^ 
The procedure generates K samples T’fc^F(a|F,D) 
and i?fc^E(a|F,D), based on which it approximates 
T(a|F,D) = and E;(a|F,D) = j^Ek^k- 

Since all Fk and Ek, respectively, come from the same 
distribution, by the law of large numbers this will con¬ 
verge to the true values of F(a|F,D) and i?(a|F,D) for 
A —)■ 00 . The procedure needs some means to deter¬ 
mine if a path is non-terminating. This could be done, 
for instance, by checking if the path reached a state 
space which contains no terminal states and cannot be 
left anymore, or by setting a maximum path length. 


3 Harsanyi-Bellman Ad Hoc 
Coordination 


The problem of incomplete information is solved in 
Bayesian games by assuming that the type spaces Qj 
and type distribution A are common knowledge. This 
admits a solution in the form of the Bayesian Nash 
equilibrium Harsanyi, 1968 , here defined for SBGs: 


Definition 7. Let iF* be the history at time t and 
define 0_i = A Bayesian Nash equilibrium 

(BNE) in state s* is a strategy profile (tti, ..., 7r„) in 


which, for all* S A and 9i G 0^, tt^ maximises 


A(t, 9_i\9i) Ui{s\a, 9i) n{H\ a, {9i, 0_d) 

( 1 ) 

where 

A(t,(0„0-d) 


N{t,9_,\9i) = 


X]e_,G0-, A(t, {9i, 9-i)) 


TT{H\a,9) = Y[ nk{H\ak,9k). 


kGN 


In ad hoc coordination problems, the ad hoc agent 
does not know the type spaces Qj and, hence, the 
type distribution A of the game. Therefore, it cannot 
compute A{t,9-i\9i). However, using the history H*, it 
can compute a posterior Pr{9_i\H*) = 
with Pr(0j|iF‘) being the probability that player j has 
type 9j based on history 


Pr(0,|iF‘) = 


L{Hy,)Pi9,) 


E9^ee.HH^9,)P{9,) 


( 2 ) 


where L{F[*\9j) = nt=ois the probabil¬ 
ity of history FF* if the type of player j is 9j, and P{9j) 
is the agent’s prior belief that player j has type 9j. 


Kalai and Lehrer, 1993] studied single-state SBGs (with 


static pure type distributions) with players who choose 
actions to maximise their expected long-term payoff. 
They have shown that, if player i maintains a posterior 
according to ([^, and if the type distribution A is ab¬ 
solutely continuous with respect to the posterior (i.e., 
A(t, {9i,9-i)) > 0 Pr(0_i|FF‘) > 0), then player Fs 

predictions of future play will eventually be correct, re¬ 


gardless of player i’s own strategy (Theorem 1 in Kalai 
and Lehrer, 1993|). It follows that, if all players maintain 


such posteriors (where A is absolutely continuous with 
each posterior), and if all players choose their strategies 
according to a modified version of 0 which replaces 
the immediate payoff with the expected long-term pay¬ 
off, then play will converge to a Nash equilibrium (NE) 
of the game (Theorem 2 in Kalai and Lehrer, 1993| ). A 
similar result was shown by Jordan, 1991] for myopic 
players (i.e. maximising immediate payoffs). 


While these are encouraging theoretical results, there 
are several potential objections concerning the use of 
NE: Firstly, if there are multiple NE, then the players 
may converge to a sub-optimal equilibrium. Secondly, a 
NE is incomplete in that it does not specify strategies 
for off-equilibrium paths. Finally, [Dekel et ah, 2004 
have shown that if the posteriors of the players are not 
identical, then they might converge to a solution which 
is not a NE. However, our main concern with NE is 
that it makes strong behavioural assumptions about the 
























players’ behaviours (such as perfect rationality) which 
may be difficult to justify in ad hoc coordination. For 
instance, there is no guarantee that all players main¬ 
tain posteriors according to (H). The same arguments 
hold for solution concepts in extensive form games, 
such as the perfect Bayesian equilibrium and sequential 


equilibrium Fudenberg and Tirole, 1991 


Rather than attempting to converge to NE, it is appeal¬ 
ing to use 0 as a best-response rule, since it maximises 
the expected payoff with respect to what types the ad 
hoc agent believes the other players to have and their 


strategies for all types. Based on Theorem 1 in Kalai 


and Lehrer, 1993 , we know that the agent’s beliefs, and 


hence its expected payoffs, will be correct after some 
time. However, in its current form, 0 only consid¬ 
ers immediate payoffs whereas optimal behaviour may 
require an agent to take payoffs of future states into ac¬ 
count. Therefore, we propose to combine (|I] ) with the 
Bellman optimality equation Bellman, 1957| to obtain 
a best-response rule which we call Harsanyi-Bellman 
Ad Hoc Coordination. Since ad hoc coordination re¬ 
quires that the agent does not know the type spaces 
Qj , we assume instead that the ad hoc agent is provided 
with user-defined type spaces 0*, and we sometimes 
refer to 0j as the true type spaces. 

Definition 8. Let F be an ad hoc coordination prob¬ 
lem where ad hoc agent a controls player i and has 
access to user-defined type spaces 01^ = 
Harsanyi-Bellman Ad Hoc Coordination (HBA) is de¬ 
fined as a\ ^ argmaxa- Efi{H*), where Ef'{H) = 

^ Pr(r,|iL*) l[7T,{H,a„e*) 


91 , G e*. 


; 6 A-i 




is the expected long-term payoff for player i of taking 
action in state s after history H {oi-i = (ai,a_i)), 
and Q‘^{H) = 


^ T{s,a,s') 

s'es 


Ui{s, a, a) -I- 7 max Eft 

ai 



. 

is the expected long-term payoff for player i when joint 
action a is executed in state s after history H, with 
0 < 7 < 1 being the discount factor. 


HBA is a modification of (IH which replaces A(t, 9_i\0i) 
by the posterior Pr(0_i|iH) ([^, and in which the im¬ 
mediate payoff Ui is replaced by an altered version ([^ 
of the Bellman optimality equation. The actual history 
H* is used to compute the posterior, and the projected 
histories H are used to generate all future trajectories. 

Each user-defined type 9* G 0* is a hypothesis about 
the behaviour of player j. While this gives HBA great 
flexibility (as 0* may include a variety of behaviours), 
it is important to note that the accuracy of 0, and 


hence efficiency of HBA, depends on how closely the 
user-defined types capture the players’ true types. In 
this respect, we state two useful properties of HBA: 

Proposition 1. Let P be a SBG with static pure type 
distribution A. If all players i G N are controlled by an 
HBA agent ai with user-defined type spaces 0j’*, and 
if Wj i : Qj C 0*’*, then play will converge to NE. 


This follows from Theorems 1 and 2 in Kalai and 


Lehrer, 1993 together with the fact that Qj C 0*’* for 
all i and j (with i ^ j), which means that the type 
distribution A is always absolutely continuous with re¬ 
spect to the players’ posteriors. Note that, while this 
proposition does not directly relate to ad hoc coordina¬ 
tion, its does guarantee the minimum requirements of 
convergence and optimality in self-play, as formulated 


Bowling and Veloso, 2002 


Eor the next proposition, we define the class of de¬ 
terministic learners, denoted 0^, which consists of 
all types 9j where, for all times t and histories H*, 
there exists a unique sequence (Xaj)ajGAj such that 
Trj{{H\{a,s)),aj,9j) + Xaj = nj{H\aj,9j), ior a\\ 
(a, s) G A X S'. In other words, a deterministic learner al¬ 
ways learns the same from a given history. By definition, 
this includes all fixed (i.e. non-changing) behaviours. 

Proposition 2. Let P be a SBG with static pure type 
distribution A, where a controls i. If Qj C 

0^ A Qj C 0*, then a will be optimally efficient. 


This follows from the fact that there is some point af¬ 
ter which HBA knows the players’ types (Theorem 1 
in [Kalai and Lehrer, I993| ) and, since all types are de¬ 
terministic learners, the expected payoffs 0 are correct. 
Since HBA chooses actions with maximum expected 
payoffs, according to the Bellman principle [Bellman 


1957 , it follows that it achieves optimal efficiency. Note 


that HBA is itself a deterministic learner, hence HBA 
achieves optimal efficiency in self-play. 


Both propositions assume that 0 can be implemented 
directly, which is often infeasible. In Sections [^and[^ we 
show how HBA can be implemented as a reinforcement 
learning procedure and an exact planning procedure. 


3.1 Temporally Reweighted Posteriors 

A potential problem with the posterior defined in 
0 is that it assigns zero probability to a type 9j if 
TTj(H^, afj,9j) is zero for any t. This can be problematic 
for the following reasons: If the game uses a dynamic or 
mixed type distribution, and if Vv{9j\H*) = 0 for a type 
9j that is not currently the true type of player j, then 
Pr(0j \H'^) = 0 for all times t > t, even if player j’s type 
changes to 9j. Furthermore, if we have a user-defined 
type 9* which approximates the true type 9j of player j 
in a subset S* C S (i.e. TTj{H*,aj,9*) « TTj{H*,aj,9j) 



















for s* e S*), but not outside S*, then ([^ might assign 
zero probability to 0* once player j leaves S*. However, 
9* may be the best approximation we have for S*, so 
it would be useful if (H) was able to quickly reassign 
positive probability to 0* once player j returns to S*. 
To address these problems, we introduce temporally 
reweighted posteriors: 

Definition 9. A temporally reweighted posterior (TR- 
posterior) is defined as in ([^ by redehning 


t-i 

L{H%) = ^ /(t - r) a], 0,) (4) 

T = 0 

where /(^) > 0 and /(^) > /(^ + 1), for all ^ S N+. 

The function / is called the time weight and can assume 
various forms. An example of a simple but useful time 
weight, called the general time weight, is given by /(^) = 
max[0, a—1)^] where a, 5, c G Kq . This time weight 
can be used to produce various behaviours, depending 
on the parameters a, b, c. In particular, it can be used 
to give greater importance to more recent events, which 
means that HBA is able to quickly reassign probabilities. 
However, the crucial aspect of Q is that it defines a sum 
rather than a product, which means that the problems 
described above do not occur. 


3.2 Conceptual Types 


If the user-defined type space 0* for player j does not 
include the true type space Qj (i.e. Qj Q*), then j 
might assume a type which is unknown to HBA, caus¬ 
ing its expected payoffs to be inaccurate. In such cases, 
it would be useful if HBA was able to learn new types 
from experience. This opens up the possibility of us¬ 
ing methods for opponent modelling (e.g. case-based 
reasoning Wendler and Bach, 2004| or recursive mod¬ 
elling Gmytrasiewicz and Durfee, 2000] ) which can be 
included in 0*. In this work, we use a combination of 


case-based reasoning and fictitious play [Brown, 195T 


called conceptual types. Conceptual types are based on 
the observation that behaviour may not be specified 
on a state-by-state basis but rather on abstractions of 
state spaces. (An example are the “information sets” 
in extensive form games.) That is, there may be some 
world conceptualisation inherent in a behaviour. While 
the types in 0* are used to hypothesise behaviours 
directly, a conceptual type can be used to hypothe¬ 
sise a world conceptualisation underlying a player’s 
behaviour. Combined with the player’s observed ac¬ 
tions, this can be used to generalise actions to unseen 
states and increase accuracy in rarely visited states. 


Definition 10. A conceptual type (c-type) for player 
j is a tuple {dj,r,f), where dj : S x S ^ Kg is a 
symmetric distance function for pairs of states, r G 


K+ is a radius, and / is a time weight (as defined in 
Section 3.11, with 






\Aj\ ^ if <t: g(s‘, s'”) >0 else 

0 f i* “ 9{s\ s”) 


where g{si, S 2 ) = max[0,1 — dj{si, S 2 ) r and p is a 
normalisation constant s.t. ~ 

The function g is the hypothesised world conceptualisa¬ 
tion of player j, where dj and r specify how similar two 
states are from the perspective of player j (examples 
given in Section . The time weight / can be used to 
give greater importance to recent events, which allow 
c-types to adapt quickly to changing behaviours. Note 
that we can include multiple c-types in 0*, each cor¬ 
responding to a different world conceptualisation, and 
the posterior filters out those types which do not fit. 


4 Simulated Experiments 

4.1 Experimental Setup 

We evaluated different configurations of HBA in a mul¬ 
tiagent logistics domain called level-based foraging (see 
Figure [^. A level-based foraging problem consists of a 
rectangular grid with n players and m foods. Each field 
in the grid is either empty or occupied by one player 
or one food. All players and foods have a level (€ N’*') 
where no food has a level greater than the sum of any 
4 players’ levels. A player can choose among 5 actions: 
N, E, S, W, and load. The first 4 actions move the 
player into the corresponding direction if the field is 
empty and inside the grid. A group of 1 to 4 players 
can load a food if they are placed on helds next to the 
food and if the sum of their levels is at least as high 
as the food’s level. A player which successfully loads a 
food obtains a payoff equal to the level of the loaded 
food. At all other times, it receives a negative payoff 
of -0.01. To avoid conflicts and keep this solvable, the 
foods are placed such that the Euclidean distance be¬ 
tween each of them is greater than 1, and no food is 
placed at any border of the grid. The players’ goal is to 





















L2 






'•l2' 


LI 







L3 



LI 





L3 

L2) 




















L2 







L2 









LI 







• 


' 




■ 





Figure 1: Level-based foraging domain. Players are 
marked by circles and foods are marked by squares 
(the levels are shown inside). Left: Each player can load 
a food. Right: No player can load a food. 






























































collect all foods in minimal time, while also trying to 
maximise their own payoffs. Since the players have dif¬ 
ferent abilities (i.e. levels) and are spatially distributed, 
this requires strong coordination of their behaviours. 


We specify 6 classes of types. The first 4 classes contain 
types with fixed behaviours (i.e. they do not change over 
time). They each have a parameter a which specifies 
the radius of their sight: HI always goes to the closest 
visible food. H2 goes to the one visible food which is 
closest to the centre of all visible players. H3 always 
goes to the closest visible food with compatible level 
(i.e. it can load it) and H4 goes to the one visible food 
which is closest to all visible players such that the sum 
of their and H4’s level is sufficient to load the food. Hl-4 
try to load the food once they are next to it. If they do 
not see a food, they go into a random direction. The 
last two classes specify types with learning behaviours: 
Class 5 contains all instances of JAL and class 6 all 
instances of CJAL, as specified in the next paragraph. 


We evaluated various configurations of HBA and three 
alternative algorithms: JAL [Claus and Boutilier, 1998 


learns the action frequencies of each player in each state 
(i.e. opponent modelling) and uses them to compute 


expected action payoffs; CJAL Banerjee and Sen, 2007 


is similar to JAL but learns the frequencies conditioned 


on its own actions; WoLF-PHC Bowling and Veloso 


2002 is a hill-climbing method in the space of mixed 


strategies. All three algorithms behave differently in ad 
hoc coordination [Albrecht and Ramamoorthy, 201^ 


A single framework (Algorithm]^ was used to imple¬ 
ment each ad hoc agent. We assume that the ad hoc 
agent is able to observe the states of the game, each 
player’s actions, and its own payoffs. For simplicity, we 
also assume that the agent knows the levels of all play¬ 
ers and foods. The framework uses a table Q to learn 
the expected long-term payoffs of joint actions, similar 
to Q-learning Watkins and Dayan, 199^. To acceler¬ 


ate learning, it uses an eligibility trace e (see Sutton 
and Barto, 1998|) to connect current payoffs with past 


actions. We assume that the agent has access to a simu¬ 
lator Simulate(s, a) which, based on the transition (T) 
and payoff (rti) functions of the game, returns a succes¬ 
sor state a' and payoff u after taking joint action a in 
state a. This simulator is used in a sampling-based plan¬ 
ning procedure Kearns et ah, 1999[ ExPAND(d, s, e) 
which, starting in state s, generates a future trajectory 
of length d and updates Q using the eligibility trace e. 
The function ExpPay((5, s, a^) computes the expected 
payoff for taking action ai in state a based on Q, and the 
function OppActions(s) samples actions for all other 
players j ^ i in state a. HBA implements Expand us¬ 
ing 0 and its posterior, and OppActions using its 
posterior and user-defined types. C/JAL implement 
these functions using their learned action frequencies. 


Algorithm 2 Reinforcement learning framework 
Set Q{s, a) •<— 0 and e(s, a) •<— 0 for all {s,a) G S x A 
Repeat until s* G 5: 

Observe: current state s* 

With probability 1 — ei: a* = ChooseAction(s‘), 
else sample a) ~ Ai 

Observe: joint action a*, own payoff uj, next state 
UpdateQ(s‘, a*, uj, 

Repeat x times: ExPAND(d, COPY(e)) 

ExPAND(d, s, e): 

Repeat d times or until s G S'. 

With probability 1 —£ 2 : a,i = ChooseAction(s), 
else sample ai ~ Ai 

a-i G- OppActions(s) 

{ui, s') G- Simulate(s, (oi, a_i)) 

UpdateQ(s, (fli, a_i), Ui, s', e) 
s <— s' 

UpdateQ(s, a, u, s', e): 

5 = /3(u-|-7maxa;ExpPAY((5,s',ai) — Q{s,a)) 
e(s, a) t— 1 

For all (s, a) G S X A s.t. e(s, d) > Cmin do: 

Q{s, a) G- Q{s, a) -I- J e{s, d) 
e{s,d) G- \ e{s,d) 

ChooseAction(s) : 

Return ai ~ argmaxa^ ExpPay(Q, s, di) 


For WoLF-PHC, the framework defines Q and e on 
S X Ai (rather than S x A) and ExpPay(Q, s, Oi) is 
simply defined as Q{a,ai). Since WoLF-PHC does not 
model its opponents, we implement OppActions the 
same way as in JAL. The function ChooseAction(s) 
is redefined to ai ^ 7r(s), where tt is the mixed strategy 
maintained in WoLF-PHC (cf. Tables 5 and 6 in [Bowl- 
ing and Veloso, 2002 ). 

All algorithms used identical parameters: /3 = .2, 7 = .9, 
A = .9, ejnin = .01, Cl = 0, £2 = -2, a: = 3, d = 20. For 
WoLF-PHC, we used learning rates 5w{t) = (1000 + 
j^)“^ and 6i{t) = 2Sw{t). For HBA, we used uniform 
prior beliefs {P{0*) = |0j|“^) and a = 10, 5 = .01, 
c = 3 for the general time weight. To obtain estimates 
of flexibility and efficiency, we used Algorithm [^ with 
i = l,ri=r 2 = l,K = 1000, where we assumed a path 
to be non-terminating if it reached t = 1000. The initial 
states were generated with random positions and levels 
for all players and foods, with the maximum level being 
equal to the number of players. All agents were tested 
on the same sequence of games and random numbers. 

4.2 Results 

We tested the effectiveness of TR-posteriors by simu¬ 
lating the two situations described in Section |3.1[ All 
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Figure 2: Results of simulated experiments, averaged over 1000 runs. Markers have the same colour if the difference 
is statistically insignificant (based on paired t-test with 5% significance level). “Cor” is HBA with correct types, 
“Gtw” is HBA using TR-posterior with general time weight, “Uni” is HBA with unlimited normal posterior, and 
“Lim” is HBA with normal posterior limited to 9 most recent events. 


tests were run on a 8 x 8 grid with 2 players and 5 foods. 
In Figure [^, we used 02 = ©2 = {H1-H4 | cr = oo} 
and a dynamic pure type distribution which changed 
the type of player 2 after every 10 to 20 time steps. 
In Figure [^, we used ©2 = {H1-H4 | cr = 3, 5, 7}, 
©2 = {H1-H4 I cr = oo} (i.e the types in ©2 were accu¬ 
rate only for subsets S* C S) and a static pure type 
distribution. In both cases, the efficiency of HBA was 
significantly higher when using a TR-posterior with gen¬ 
eral time weight (Gtw) compared to both the normal 
posterior defined in ([^ (Uni) and a normal posterior 
which was limited to the 9 most recent events (Lim), 
which is the same time frame used in Gtw. In Fig¬ 
ure [^, Gtw even achieved the same efficiency as a 
version of HBA which always knew the correct type 
of the other player (Gor). All HBA agents achieved a 
perfect flexibility of 1. 

We tested HBA with 4 conceptual types = {dj, r, /) 
where /(^) = [■C < 10]i and r = 1. In the following, 
we write s.pj (s.fk) to refer to the position of player j 
(food fk) in state s, and fk & s to say that food fk is 
available in state s. The distance functions dj are 

d](si,S2) = [si ^ S2]l 00 

d'j{si,S2)=[si.pj = S2.Pj AVk : fk G Si fk G S2]iOO 

dj{si,S2)=4'{si.pj,S2.Pj)+Y.k:ff,e>^^Vfkes2'^isi.fk,fJ.)~^ 

dj{si,S2)=d^{si,S2)+J2v Hsi-Pv, S2.Pv)0Jf^-^ 

where (j){xi, X 2 )=\og{l+tp{xi,X 2 ))l, p=si.pj+^{s 2 .pj- 
Si-Pj), ujy=mm[ip{si.py, p),'tp{s 2 .Pv, p)], and ip{xi,X 2 ) 
denotes the Euclidean distance between xi and X 2 - All 
tests were run on a 8 x 8 grid with 2 players and 5 foods, 
using ©2 = {HI-H4,JAL,CJAL | cr = 00} (C/JAL used 
same parameters as HBA), ©2 = {O 2 } (each c = 1,..., 4 
tested separately), and a static pure type distribution. 
The results in Figure show that HBA achieved good 
efficiency (compared to Cor) using while the other 
c-types were less efficient. All HBA agents achieved 
statistically equivalent flexibilities of 0.86 ± 0.01. 

Finally, we tested HBA, JAL, CJAL, and WoLF-PHC 


on a 10 X 10 grid with 3 players and 8 foods, using 
©2^3 = {H1-H4,JAL,CJAL | cr = 5,7,9} and ©^ 3 = 
{H1-H4 I cr = 00}. To add more realism, players 2 and 
3 were “defective” with probability 0.2, where a defec¬ 
tive player changed its type randomly every 10 to 30 
time steps. While the potential of HBA is demonstrated 
by Cor, it would also be useful to know the optimal solu¬ 
tion to the problem. However, with a complex problem 
such as this one, we were unable to compute optimal 
solutions. Instead, we had 6 humans play the game in 
a graphical user interface (each one played the full 1000 
runs, distributed over 7 days at their own convenience), 
where no human was familiar with the technical details 
of this work. We do not necessarily claim that humans 
produce optimal solutions, but we expect them to per¬ 
form consistently well in this setting. To cope with the 
increased problem size, we set the planning power of 
the algorithms to a: = 10 and d = 30 (cf. Algorithm]^. 

The results (Figure]^) show that HBA clearly outper¬ 
formed all alternative algorithms, with Uni and Gtw 
being over 100% and 200% more efficient, respectively. 
This is despite the fact that the user-defined types ©2^3 
did not include any true types of the players. We also 
tested HBA with the c-types and 9j (added sepa¬ 
rately to ©2 3) but found that the efficiency of HBA did 
not improve significantly. This is since C/JAL learned 
similar behaviours to HI and H3, which were already 
covered in ©2 3. We found that HBA’s posteriors often 
assigned high probabilities to Hl/3 when the true type 
of the player was in fact C/JAL. Since Hl/3 ignore 
other players, this means that C/JAL did not effec¬ 
tively coordinate their behaviours with other players. 
We found similar results for WoLF-PHC. As was ex¬ 
pected, the humans achieved high efficiency (Figure 
shows the best human) and outperformed even Cor. 
One reason for this is the fact that the humans had 
much greater planning power than HBA. Lastly, HBA 
achieved higher flexibilities (.83 ± .01) than JAL (.734), 
CJAL (.749), and WoLF-PHC (.744), while the humans 
all achieved perfect flexibility (1.0). 










5 Human-Machine Experiment 


5.1 Experimental Setup 


We conducted a large-scale human-machine experiment 
at the Royal Society Summer Science Exhibition 2012. 
Therein, the human participants played repeated Pris¬ 
oner’s Dilemma (PD) and Rock-Paper-Scissors (RPS) 
against HBA and alternative algorithms, where each 
game was played for 20 rounds. We collected data from 
427 participants, of which 186 played PD and 241 played 
RPS. The lowest and highest recorded ages were 9 and 
72, respectively, with an average age of about 17. 


A large public exhibition such as this one is an excellent 
testbed environment for ad hoc agents, since the visitors 
vary widely in factors such as age, intelligence, and be¬ 
haviour. However, in order to make statistically relevant 
comparisons, we required data from many participants. 
Therefore, the games needed to be simple enough so 
participants would understand them quickly, yet they 
also needed to be interesting in terms of coordination 
strategies. PD and RPS are two widely studied prob¬ 
lems in game theory which we believe cover these prop¬ 
erties. In PD, the symmetric payoffs are Ui{C,C)=3, 
Ui{D, D) = l, D) = 0, ui{D,C)=5. The problem 
here is that the only NE, and hence stable outcome, 
is at (D,D), while (C,C) is the only outcome that has 
both the highest welfare (sum of payoffs) and fairness 
(product of payoffs) but is unstable since the players 
could deviate to obtain higher immediate payoffs. In 
RPS, the payoffs are -I-1/0/-1 for won/even/lost games. 
The only NE is for all players to play randomly. How¬ 
ever, even if humans attempt to play randomly, they 
often fall back to patterns Wagenaar, 1972 against 
which the other player can coordinate its actions. 


Our hypothesis for the experiment was that the human 
would switch between several simple behaviours, as op¬ 
posed to having one complex behaviour. Therefore, we 
modelled the problem as a SBG with a dynamic mixed 
type distribution (unknown to us) which governed the 
type of the human, and we provided HBA with a small 
set of types (given in Tables and which we be¬ 
lieved the human could have. HBA did not use any 
conceptual types. 


The alternative algorithms were CJAL for PD, which 
was shown to outperform both JAL and WoLF-PHC 
in PD Banerjee and Sen, 2007 , and JAL for RPS, 
which is guaranteed to converge to NE in self-play 
in zero-sum games Brown, 1951 . We implemented 
all algorithms using a single framework (Algorithm , 
where we set I* = 10 for PD, I* = 1 for RPS, and t* = 20. 
The function OppStrat(s'^, a’’’) returns the probability 
that players j i choose actions aj in state s'’’. HBA 
implements this by averaging over all user-defined types 
in 0* using its current posterior, and C/JAL do this 


Algorithm 3 Exact planning framework 

Repeat: 


Observe current state s* 

For all at £ Ai do: 

11 

t + l t + l\ 1 t \ 

...,s ^ ,a )\ai = aij 

where 1 

= min[r, t* — t] — 1 


't + l t + l 

E{ai) = 

1 1 OPPSTRAT(s’',a”) ^ )ui(s”,a”) 


.r = t r = t 

Sample action a* 

~ argmaxa; E{ai) 


using their learned actions frequencies. While PD and 
RPS have no states, we found that the performance 
of C/JAL could be further improved by introducing 
“artificial” states, which we simply defined as s* = 

(in the first round, C/JAL assumed the opponent to 
play randomly). HBA used uniform prior beliefs and 
the general time weight with a = 10, 6 = 0.05, c = 3. 

The procedure of the experiment was as follows: First, 
we randomly sampled a participant from the set of 
visitors which were currently at our exhibit. The partic¬ 
ipant was then brought to a dedicated table with a chair 
and a laptop on it. The laptop ran a programme, with 
an intuitive graphical user interface, which prompted 
the participant to choose between PD and RPS. The 
rules of the games were explained both textually in the 
programme and in person by one of our staff members 
to make sure the participant understood the rules. The 
game was then played in two matches, each lasting 20 
rounds. One of the matches was against HBA and the 
other match against C/JAL, but this was hidden from 
the participant and the order was chosen randomly. 
The programme displayed the current match, round, 
and scores of all players, and also allowed to display 
the rules at any time. At the end of each round, the 
participant was shown the actions and scores of both 
players, and at the end of each match, the participant 
was given a summary of the scores. 

5.2 Results 

In the following, all significance statements are based 
on paired t-tests with 5% significance level. Figures 
and Wp show the results for PD and RPS, respectively. 
In both games, the average total payoffs of HBA and 
C/JAL were statistically equivalent. Since the time was 
fixed to 20 rounds, it means that they achieved equal 
efficiency. This is, in fact, a positive result considering 
that C/JAL are strong candidates in PD/RPS. In addi¬ 
tion, as we discuss in the following, HBA behaved very 
differently from C/JAL, with beneficial side effects. 

In PD, the most desirable long-term outcome is (C,C) 
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Figure 3; Results of the human-machine experiment. Circles and whiskers correspond to mean, minimum, and 
maximum values, respectively. The welfare plot in (a) shows the median value and 25%/75% percentiles. 


since it is both welfare and fairness optimal, and since 
it is a non-myopic equilibrium Brams, 1993 , mean¬ 


ing that no player has a long-term incentive to deviate. 
With this in mind, we point out that in over 28% of the 
games, HBA and the human played (C,C) in at least 
50% of the final 10 rounds of the game, while CJAL did 
not achieve this in any game. Thus, HBA achieved a sig¬ 
nificantly higher total welfare than CJAL (Figure]^). 
This is despite the fact that neither of them was op¬ 
timised for social welfare. The reason for this is that 
HBA was planning more accurately than CJAL. When 
computing the expected payoffs E{ai), CJAL uses its 
learned action frequencies to obtain probabilities for 
each trajectory in H(ai). However, these probabilities 
can only be accurate for states that have been visited 
frequently enough. Moreover, if a player changes its be¬ 
haviour, CJAL requires new evidence from all states to 
accurately reflect the change. On the other hand, HBA 
uses its posterior and types to compute probabilities of 
trajectories. Therefore, once HBA has an accurate pos¬ 
terior, it can use the types to accurately plan in the 
entire state space of the game, including unseen states. 
This also allows HBA to plan the effects of its actions 
on the other player, which means that HBA may take 
actions to manipulate the player’s decisions. Finally, if 
a player changes its behaviour, HBA only needs to up¬ 
date its posterior, which requires much less information 
than the update in CJAL. 


In RPS, the crucial questions is whether a player is 
winning or not. Interestingly, the winning rate of HBA 
(53.71%) was significantly higher than the winning rate 
of JAL (43.98%), as shown in Figure]^. While in PD 
the good performance of HBA was due to its planning 
capabilities, in RPS this was not as relevant since the 
planning horizon was limited to trajectories of length 1. 
Rather, HBA’s good performance was due to the fact 
that it recognised changed behaviours faster than JAL. 
Indeed, in a game such as RPS, it can be expected that 
the human players change frequently between different 
strategies. This is confirmed by the statistics shown 
in Figure [^, which show the average number of types 
used by the human players and the average duration. 
The statistics are based on HBA’s posteriors, where the 


number of types for player i in a play corresponds to the 
number q in with to = 0 and tq = 20, for 

which arg maxg; Pr(0i|i7'’’) C argmaxg, Pr(0i|7C’+^) 
for all ty-i < T < ty and y G {!,..., g}, and where 
the average duration is ^ 'Thy'ty ~ ty-i- According to 
these statistics, the human players had 4.45 types with 
a duration of 4.96 rounds in PD, and 8.25 types with a 
duration of 2.46 rounds in RPS. Clearly, with a dura¬ 
tion of only 2.46 rounds, planning was not as important 
as recognising changed types. By using TR-posteriors, 
HBA was able to do this effectively. 

6 Summary &; Open Questions 

This work is concerned with the ad hoc coordination 
problem, in which the goal is to design an autonomous 
agent (the ad hoc agent) which can achieve optimal 
flexibility and efficiency in a multiagent system in which 
the behaviour of the other agents is not a priori known. 
We make three important contributions to the ad hoc 
coordination problem: 

1. We propose a game-theoretic model, SBC, which 
captures the notion of private information in the 
form of types. Based in this model, we give formally 
concise definitions of flexibility, efficiency, and the 
ad hoc coordination problem. We also provide a 
procedure which can be used to estimate the ad 
hoc agent’s flexibility and efficiency. 

2. From this model, we derive a principled solution, 
HBA, which utilises a set of user-defined types 
in a planning procedure to find optimal actions 
in the sense of Bayesian Nash equilibrium and 
Bellman optimal control. We also propose two 
possible extensions which enable HBA to recognise 
changed types and learn new types. 

3. We show how HBA can be implemented as a rein¬ 
forcement learning and exact planning procedure, 
and we provide extensive empirical evaluations 
in a complex multiagent logistics domain and a 
large-scale human-machine experiment. Our re¬ 
sults show that HBA is both more flexible and 
efficient than alternative methods. 
























The work presented in this paper provides a rich ground 
for future research, including the following open ques¬ 
tions: 

• A crucial design parameter of HBA are the user- 
defined type spaces 0* provided to it. In this re¬ 
gard, an important direction for future research 
would be to analyse how closely 0* must approx¬ 
imate 0j for HBA to be able to achieve optimal 
flexibility and efficiency. 

• Another design parameter of HBA is the poste¬ 
rior Pr(-|iJ‘), and in this work we discussed two 
different formulations (the product posterior and 
TR-posteriors). It would be interesting to explore 
alternative posterior formulations and to analyse 
the conditions under which they are guaranteed 
to converge to the type distribution of the game. 

• The prior belief P can be considered a meta¬ 
parameter of HBA (it is a parameter of the poste¬ 
rior, which in turn is a parameter of HBA), and in 
our experiments we assumed that the prior beliefs 
were uniform. An interesting question in this re¬ 
gard is whether HBA could automatically derive 
prior beliefs from the user-defined type spaces so 
as to further maximise its efficiency. 

• HBA currently assumes that an expert can pro¬ 
vide manually specified types for the problem at 
hand. However, this can be a cumbersome task 
in complex domains. Future work could investi¬ 
gate how HBA might generate useful types from 
the problem description so that the burden of hav¬ 
ing to manually specify types can be alleviated, or 
perhaps eliminated altogether. 

• Finally, as we employ HBA in increasingly complex 
problem domains, it becomes apparent that the 
type specifications, likewise, become increasingly 
complex. One way to reduce this type complexity 
might be to use a hierarchical type specification, in 
which types are structured into smaller sub-types. 
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PD type 

Definition 

AlwaysG 

a\ = C 

TitForTat 

= C,a\= 

TitFor2Tats 

= C, al = C if = C else D 

Optimistic 

TZi{C, iJ‘) = 1 if t < 2 V a]-^ = C V /i = 0 else 0.2 -f O.Sct 

Pessimistic 

'Ki{D,H*) = 1 if t < 2 V = D else 0.2 J- [/r > 0]i0.8cr 


u = = c]u<j = j, Et=o[< = 


Table 1: PD types. [5]i = 1 iff. b is true, else 0. 


RPS type 

Definition 

Copycat 

ofl^U{Ai),a\=a]-^ 

Retry IfWon 

a\ ~ U{Ai) if t = 0 V Ui{af~^) < 0 else a\ = 

i-focused(/i) 

■Kiioi, i7*) = g{ai, x)/ Y^ai&Ai x), x = min[t, h] 

h e {1,2} 

g{ai,x) = max[0,x - Er=i[®i~"' = o.i]i{x J- 1 - r)] 

j-focused(/i) 

of ~ argmaxa, Ea.eA, u^{ai, a,) 

h e {1,2} 

where 'Kj{aj,H*) is obtained using i-focused(/i) for j 


Table 2: RPS types. U is the uniform distribution. 
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